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Abstract 



Boosting methods combine a set of moderately accurate weak learners to form a highly 
accurate predictor. Despite the practical importance of multi-class boosting, it has received 
far less attention than its binary counterpart. In this work, we propose a fully-corrective 
multi-class boosting formulation which directly solves the multi-class problem without di- 
viding it into multiple binary classification problems. In contrast, most previous multi-class 
boosting algorithms decompose a multi-boost problem into multiple binary boosting prob- 
lems. By explicitly deriving the Lagrange dual of the primal optimization problem, we are 
able to construct a column generation-based fully-corrective approach to boosting which 
directly optimizes multi-class classification performance. The new approach not only up- 
dates all weak learners' coefficients at every iteration, but does so in a manner fiexible 
enough to accommodate various loss functions and regularizations. For example, it enables 
us to introduce structural sparsity through mixed-norm regularization to promote group 
sparsity and feature sharing. Boosting with shared features is particularly beneficial in 
complex prediction problems where features can be expensive to compute. Our experi- 
ments on various data sets demonstrate that our direct multi-class boosting generalizes as 
well as, or better than, a range of competing multi-class boosting methods. The end result 
is a highly effective and compact ensemble classifier which can be trained in a distributed 
fashion. 

Keywords: multi-class boosting, Lagrange duality, column generation, convex optimiza- 
tion, distributed optimization, alternating direction methods 



1. Introduction 



A significant proportion of the most important practical classification problems inherently 
involve making a selection between a large number of classes. Such problems demand ef- 
fective and efficient multi-class classification techniques. Unlike binary classification, which 
has been well researched, multi-class classification has received relatively little attention 



due to the inherent complexity of the problem. Some important steps have been (see Wu 



et al. (2004); Crammer and Singer (2001); Guruswami and Sahai (1999) for instance), but 
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the primary approach thus far has exploited large numbers of independent binary classi- 
fiers. An example of this approach is the extension of a binary classification algorithm to 
the multi-class case by considering the problem as a set of one-vs-all binary classification 
problems. 

Boosting has recently attracted much research interest in many scientific fields due to 
its huge success in classification and regression tasks, especially in the first real-time face 
detection application (Viola and Jones, 2004). Both theoretical and empirical results show 
that boosting methods have competitive generalization performance compared with many 
existing classifiers in the literature. To explain why boosting works, Schapire et al. (1998) 
introduced an appropriate margin theory, which was inspired by the margin theory in sup- 
port vector machines, and concluded that boosting is also an effective classifier which max- 



imizes the minimum margin over the training data. Extending this idea, LPBoost (Demiriz 



et al. , 2002 ) seeks to maximize the relaxed minimum margin (soft margin) using hinge loss. 



The proposed boosting algorithm is fully corrective in the sense that all the coefficients 
of learned weak classifiers are updated at each iteration. Such fully-corrective boosting 
algorithms typically require fewer iterations to achieve convergence. 

Despite the significant attention that boosting-based binary classification methods have 
attracted, multi-class boosting has been much less well studied. As with multi-class clas- 
sification in general, the most natural strategy for multi-class boosting is to partition the 
problem into a set of independent binary classification problems. In this scenario each 
binary classifier is charged with distinguishing a subset of the classes against all others. 
Methods such as one-vs-all, all-vs-all and output code-based methods belong to this cate- 
gory. Although such partitioning strategies greatly simplify the problem, they inevitably 
impact upon the final solution. In many cases the partitioning strategy changes the cost 
function to be optimized, and thus delivers a sub-optimal solution. The all-vs-all approach, 
a.k.a. one-vs-one, however, has been shown to achieve excellent classification accuracy. In 
this approach k{k — l)/2 two-way pairwise classifiers are trained, with k the number of 
classes. The computation of both training and testing can be prohibitively expensive even 
when k is of medium size. More importantly, however, almost all of these strategies do not 
directly optimize the multi-class decision function that they seek to exploit. 

In this work, we proffer a direct approach to fully-corrective multi-class boosting. In 
order to achieve this result, we generalize the concept of the separating hyperplane and 
margin in binary boosting to multi-class problems. This allows the development of a single, 
fully-corrective, multi-class boosting classifier which directly optimizes multi-class classifi- 
cation performance. Similar ideas have been used in multi-class support vector machines 
(Crammer and Singer 2001 Weston and Watkins 1999 Elisseeff and Weston, 2001). To 
our knowledge, it has not been employed to design fully-corrective multi-class boosting. As 
shown in (Shen and Li, 2010) fully-corrective boosting in general leads to more compact 
models. Here for the first time, we develop fully-corrective multi-class boosting. 

In deriving out direct formulation we also generalize the fully-corrective ii regularized 
boosting algorithms to arbitrary mixed- norm regularization terms. Mixed- norm regular- 
ization, also known as group sparsity, has been used when there exists a structure that 
separates the model into disjoint groups of parameters. For £i^2-norm regularized boosting, 
for example, each such group of parameters is subject to a common ^2-iiorm regularizer. The 
key intuition behind structural sparsity is that informative features are commonly shared 
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between multiple classes. For example, traffic warning signs fiave a common triangular 
shape with various symbols inside. These basic shared features should be used to help 
differentiate warning signs from other traffic signs while the symbols inside can be used 
to differentiate different warning signs. In this work, we aim to enable the selection of a 
common subset of features which are informative in identifying a wide range of classes. 

The key idea behind our column generation-based boosting approach is that, given an 
example x, with true label y, the output of the decision function for the correct label must 
be larger than the output of the decision function for all incorrect labels, 

Fy{x)>Fr{x), ^r^y. 

We then formulate a convex optimization problem, which maximizes Fy{x) — Fr{x) subject 
to the selected regularization term. This leads to a constrained semi-infinite convex opti- 
mization problem, which may have infinitely many variables. In order to design a boosting 
algorithm, we explicitly derive the Lagrange dual of this problem and apply an iterative 
convex optimization technique known as column generation. When the hinge loss is used, 



our formulation can be viewed as a direct extension of LPBoost (Demiriz et al. , 2002) to 
the multi-class case. We also discuss the use of the exponential and logistic loss functions. 
In theory, any convex loss function can be employed, as in the binary classification case. 



Note that the AnyBoost framework of Mason et al. (2000) can not be adopted here since 



AnyBoost cannot cope with multiple constraints. In summary, our main contributions are 
as follows. 

• We propose the first direct approach to fully-corrective multi-class boosting based on 
the generalization of the conventional "margin" in binary classification. 

• Within this direct, fully corrective boosting framework, we design new boosting meth- 
ods that promote feature sharing across classes by enforcing group sparsity regular- 
ization (referred to as MultiBoostS'^°"P). We empirically show that by enforcing group 
sparsity, the proposed multi-class boosting converges faster while achieving better or 
comparable generalization performance. The fact that the algorithm converges fast 
means that fewer features are required for a given classification accuracy and there 
is a significant improvement in run-time performance. Our derivation for designing 
multi-class boosting methods is applicable to arbitrary convex loss functions with 
general ii^p {p > 1) mixed-norms. To our knowledge, this is the first fully-corrective 
multi-class boosting approach that promotes feature sharing using group sparsity reg- 
ularization. Moreover, we propose the use of the alternating direction method of 



multipliers (ADMM) (Boyd et al. , 2011) to efficiently solve the involved optimization 



problems, which is much faster than using standard interior-point solvers. 

Further, a new family of multi-class boosting algorithms based on a simplified formu- 
lation is proposed in order to further reduce training times. This new formulation not 
only enables us to share features and encourages structural sparsity in the learning 
procedure of multi-class boosting, but also allows us to take advantage of parallelism 
in ADMM to speed up the training time by a factor proportional to the number of 
classes. The training time required is thus similar to that required to train multi- 
ple independent binary classifiers in parallel. The proposed formulation converges 
significantly faster, while still enforcing group sparsity. 
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Since multi-class classification can be seen as an instance of structured learning problems 



of Tsochantaridis et al. (2005), the proposed formulation may also be applicable to other 
structured prediction problems. 

We briefly review most relevant work on multi-class boosting before we present our 
algorithms. 



1.1 Related work 



AdaBoost, proposed in (Freund and Schapire, 1997), was the first practical binary boosting 



algorithm. One of the limitations of binary AdaBoost is that each weak classifier's accuracy 
must be higher than 0.5. That is, a weak classifier must exhibit classification capability 
superior to that of random guessing. AdaBoost. Ml, directly extended AdaBoost to multi- 
class classification using multi-class weak classifiers. Multi-class weak classifiers, such as 
decision trees for example, represent a restricted set of weak classifiers able to give predic- 
tions on all k possible labels at each call. The fact that only multi-class weak classifiers can 
be used represents a significant restriction, as multi-class weak classifiers are complicated 
and require time-consuming training when compared with their simple binary counterparts. 
The higher complexity of the assembled classifier also implies a higher risk of over-fitting 
the training data. In addition, the requirement that a weak classifier's weighted error must 
be better than 0.5 can be hard to achieve for problems with many classes. Note that, for a 
problem with k classes, random guessing can only guarantee an accuracy of 1/k. 



The SAMME algorithm of Zhu et al. (2009), addressed this last issue, and requires 



only that the multi-class weak classifiers achieve an error rate better than uniform random 
guessing for multiple labels {1/k for k labels). When k = 2, SAMME reduces to the 
standard AdaBoost, but is still subject to all of the other limitations associated with the 
use of multi-class weak classifiers. 

To alleviate these difficulties one solution is to decompose a multi-class boosting prob- 
lem into a set of binary classification problems. To this end strategies such as "one-vs- 
all" and "one-vs-one" have been developed. Such approaches can be viewed as special 



cases of error-correcting output coding (ECOC) (Dietterich and Bakiri 



and Singer 


1999 


) is a 



1995 



Crammer 



By introducing a coding matrix, AdaBoost. MO (Schapire and Singer 
1999) is a typical example of ECOC based multi-class boosting. In this approach a set 
of binary classifiers is used, with each trained so as to recognise a subset of the classes. 
By comparison of the responses of all of the binary classifiers multi-class classification is 



achieved. Algorithms in this category include AdaBoost. MO (Schapire and Singer, 1999), 



AdaBoost. OC and AdaBoost. ECC (|G uruswami and Sahail |1999|). AdaBoost. OC can be 
seen as a variant of AdaBoost. MO which also combines boosting and ECOC. However, un- 
like AdaBoost. MO, AdaBoost. OC uses a collection of randomly generated codewords. For 



more details see (Schapire, 1997). 

The attraction of transforming a multi-class classification problem into a set of binary 
classification problems is that each of the weak classifiers need only be a simple binary 
classifier. This approach has its limitations, however, including the fact that the required 
optimisation problem is typically compromised by the partition, and that it becomes in- 
creasingly difficult to ensure that each binary classifier sees a representative sample of the 
data as the number of classes increases. An additional limitation of all partitioning algo- 
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rithms we have discussed is that they are incapable of effectively exploiting the inevitable 
similarity between classes, and thus to efficiently share features between classifiers. Since 
binary classifiers are trained independently, the resulting strong classifier can be highly 
unbalanced and often dependent on an excessive number of features/weak classifiers. 

Several approaches have been developed which aim to enable feature-sharing within 
multi-class boosting. JointBoost, proposed by Torralba et al. (2007), finds common features 
that can be shared across classes using heuristics. Weak learners are then trained jointly 
using standard boosting. In order to reduce the number of binary classifiers which need to 
be trained for multi-class problems, the authors proposed an approximate search procedure 
based on greedy forward selection. The drawback of greedy approach, however, is that it is 
short-sighted and cannot recover if an error is made. The fact that the weak learner selected 
at each boosting iteration cannot be guaranteed to be globally optimal means that the final 
ensemble is highly likely to be sub-optimal. Zhang et al. proposed training multi-class 
boosting with sharable information patterns (Zhang et al. , 2009). As a pre-processing step, 



they generate sharable patterns using data mining techniques and then train a multi-class 
boosting-based classifier using these patterns. The process of identifying sharable features 
and the training procedure are thus de-coupled, and therefore unlikely to reach the optimal 
solution. In comparison to JointBoost and Zhang et al.'s work, the method we propose 
selects weak learners systematically on the basis of structural sparsity during the training 
process and thus, at least asymptotically, will reach the globally optimal solution. 

A related approach, termed GradBoost (Duchi and Singer 2009), also exploits a mixed- 
norm in order to achieve group sparsity, but does not directly optimize the boosting objective 
function. Instead, the algorithm updates a block of variables for optimizing a quadratic 
surrogate function in a fashion similar to gradient-based coordinate descent. It is not clear 
how well the surrogate approximates the original objective function, and no proof is given. 
Since the mixed-norm regularization term is not directly optimized either, group sparsity 
is achieved heuristically by a combination of forward selection and backward elimination. 
Our work fundamentally differs from (Duchi and Singer, 2009) in that we directly optimize 
the group sparsity regularized objective by following the column generation based boosting 
(Shen and Li, 2010) without deferring to heuristics. 

Our work here can also be seen as an extension of the general binary fully-corrective 



boosting framework of Shen and Li (2010) to the multi-class case. As in (Shen and Li 



2010), we design a feature-sharing boosting method using a direct formulation, but for 
multi-class problems and using a more sophisticated group sparsity regularization. Note 
that the general boosting framework of Shen and Li (2010) is not directly applicable in our 
problem setting. 



1.2 Notation 

A bold lowercase letter (u) denotes a column vector, and an uppercase letter (U) a matrix. 
Tr(C/) represents the trace of a symmetric matrix. An element- wise inequality between two 
vectors or matrices such as u > v implies that m > Vi for all i. 

Let {xi; Ui) G M"' x {1, . . . , fc}, i = 1 . . . m, be a set of m multi-class training examples, 
where k denotes the number of classes. We denote by !K a set of weak classifiers (or 
dictionary); note that the size of "K can be infinite. Each hj{-) G IK, j = 1 . . . n, is a function 
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that maps an input x to {— 1,+1}. Although our discussion appHes equally in the general 
case where h{-) make take any real value, we use binary weak classifiers in this work. The 
matrix H G ]^"^x" captures the weak classifiers' responses to the whole of the training data; 
that is Hij = hj{xi). Each column H-j thus represents the output of the weak classifier 
hj{-) when applied to the entire training set and each row Hi- the responses of all of the 
weak classifiers to the ith training datum Xi. 

Boosting algorithms learn a strong classifier of the form F{x) = X]j'=i Wjhj{x) which is 
parameterized by a vector w S M". In our formulation of the problem we need to learn a 
classifier for each class. So for class r (where r = 1, . . . , /c), the learned strong classifier is 
Fr{x) = Yl^=i Wrjhj{x) and has parameter vector w^. We define W = [wi,W2, ■ ■ ■ ,wi:] G 
j^nxfc g^^^ = \ represent the £i norm. The £1^2 norm of a matrix is 

defined as ||VF||i,2 = Y2j ll^j:||2 with || • II2 being the £2 norm. The ^1^00 norm of W is 
ll^lli,oo = Ejmax(Ty,0. 

Here we assume that the weak classifier dictionary for each class is the same. The 
final strong classifier is a weighted average of multiple weak classifiers, and the estimated 
classification for a test datum x is F(x) = argmax Yl^=i ^r,jhj{x). 

r=l,...k 

The remaining content is structured as follows. Section [2] presents the main algorithm 
of our work. In particular, we beginning by deriving our algorithm with £1 penalty for the 
piece-wise linear hinge loss and exponential loss functions. Then we discuss group sparsity 
and derive our algorithm with the new structural sparsity for both hinge loss and logistic 
loss. We present our experimental results in Section 3.1 and conclude in Section 111 



2. A direct formulation for multi-class boosting 

In binary classification, the margin is defined as yF{x) with y € { — 1, +!}• In the framework 
of maximum margin learning, one tries to maximize the margin yF(x) as much as possible. 
A large margin implies the learned classifier confidently classifies the corresponding training 
example. We show how this idea can be generalized to multi-class problems in this section. 

2.1 MultiBoost with ^i-norm regularization (MultiBoost^^) 

The hinge loss Let us consider the hinge loss case, which is piecewise linear and therefore 
makes it easy to derive our formulation. As we will show, both the primal and dual problems 
are linear programs (LPs), which can be globally solved in polynomial time. The basic idea 
is to learn classifiers by pairwise comparison. For a training example {x,y), if we have a 
perfect classification rule, then the following holds 

Fy{x) > Fr{x), for any r ^y. 

In the large margin framework with the hinge loss, ideally 

Fy{x) >\^Fr{x), for any r / y, (1) 

should be satisfied. This means that the correct label is supposed to have a classification 
confidence that is larger by at least a unit than any of the confidences for the other predic- 
tions. This extension of "margin" to the multi-class case has been introduced in support 
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vector machines (Weston and Watkins, 1999 Elisseeff and Weston, 2001 1. As pointed out 
in Weston and Watkins (1999), to formulate multi-class problems as a pairwise ranking 



problem in a single optimization can be more powerful than to solve a bunch of one-vs-all 
binary classifications. The argument is that we may generate a multi-class data set that can 
be classified perfectly, but for which the training data cannot be separated with no error 



by one-vs-all. Recent work in (Daniely et al. 2012) theoretically proved that the direct ap 



proach to multi-class classification essentially contains the hypothesis classes of one-vs-all. 
Also because the estimation errors of these two methods are roughly the same, the direct 
approach dominates one-vs-all in terms of achievable classification performance. 

By introducing the indication operator 6s,t such that 6s,t = 1 if s = t and 6s,t = 
otherwise, the above equation can be simplified as 



dr,y + Fy{x) > l + Fr{x),\/r = l,2,...,k. 



(2) 



We generalize this idea to the entire training set and introduce slack variables ^ to enable 
soft-margin. The primal problem that we want to optimize can then be written as 



min > + v \\W\\^ 
1=1 



s.t.: 6r,yi + Hi-Wy. >l + Hi-Wr - Ci,yi,r, > 0. 



(3) 



Here > is the regularization parameter. ^ > always holds. If for a particular Xi, 
is negative, then one of the constraint in ^ that corresponds to the case r = yi will 
be violated. In other words, the constraint corresponding to the case r = yi ensures the 
non-negativeness of ^. Note that we have one slack variable for each training example. It 
is also possible to assign a slack variable to each constraint in (|3]). We derive its Lagrange 
dual, similar to case of LPBoost (Demiriz et al. 2002). The Lagrangian of problem ([s]) can 
be written as 



I j,r i,r 

with U > 0, V > 0. At optimum, the first derivative of the Lagrangian w.r.t. the primal 
variables must vanish, 

dL 

7^ = 0^^C/i, = l,Vi. (4) 



Also, 



dL 

dWr 



r i^T=yi I 

=1, due to ^ 



(5) 
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Algorithm 1 MultiBoost^^ with the hmge loss. 



Input: 

1) A set of examples {xi,yi}, i = 1- ■ - m; 

2) The maximum number of weak classifiers, T; 

Output: A multi-class classifier F{x) = argmax^J^-^ Wr,jhj{x); 

r ■' 

Initilaize: 



1) t ^ 0; 

2) Initialize sample weights, U = 1/fc; 
1 while t < r do 



1) Find the weak classifier by solving the subproblem (|8]); 

2) If the stopping criterion has been met, we exit the loop, 
if Y.T=i [^r,y, - U„] h{xi) <u + e then 

1^ break; 

3) Add the best weak classifier, /it(-), into the primal problem; 

4) Solve the primal problem Q using a primal-dual interior-point LP solver such 
as 



MOSEK (2012), such that the dual solution is also available. 



5) t ^ t + 1; 



which leads to Yli^ir^i- ~ J2i^ry Hi- > —vTJ,\/r. So the Lagrange dual can be written 
asH 



k m 
r=l i=l 

s.t.: ^ {5r,y, - Uir) Hi: < i^l^, Vr, ^ U„ = 1, Vi; ?7 > 0. (6) 

i r 
Each row of the matrix U is normalized. The first set of constraints can be infinitely many: 

i^r,y, - Uir) h{xi) < z^, Vr, and V/i(-) G J{. (7) 



We can now use column generation to solve the problem, similar to the LPBoost (Demiriz 



et al. , 2002). The subproblem for generating weak classifiers is 



h*{-) = argmax - Uir) h{xi). (8) 

The matrix U G ]^rnxk p^g^yg j-qJ^ Qf measuring importance of a training example. The 
following algorithm can be used to implement our hinge loss based MultiBoost . 

1. Strictly speaking, this is one of the Lagrange duals of the original primal because some transformations 
from the standard form have been performed. 
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The exponential loss Now let us consider the exponential loss in the section. In the 
case of the exponential loss, We may write the primal optimization problem as 

m k 

min^Y'^exp [-{Hi.Wy,^-Hi.Wr)]+iy'\\W\\^, s.t.: W>0. (9) 

i=l r=l 

We define a set of margins associated with a training example as 

Pi^r = Hi-Wy. - Hi-Wr, r = 1,. . .,k. (10) 

Clearly only when pi^r ^ 0, will the training example Xi be correctly classified. We consider 
the logarithmic version of the original cost function, which does not change the problem 
because log(-) is strictly monotonically increasing. So we write ^ into 

m k 

minlogiN y ex.-p[—pi,r]] + \\W\\i s.t.: pi^r = Hi-Wy- — Hi-Wr,yi,yr, W>0. (11) 

1=1 r=l 

The dual problem can be easily derived: 

k m 

min ^ ^ Uir log Uir 

r=l i=l 

S.t.: [6r,y.,{ELlU^l) " Uir] H,.. < 1^1^, Vr; ^ f/.. = 1, [/ > 0. (12) 

i i,r 

We can see that the dual problem is a Shanon entropy maximization problem. The 
objective function of the dual encourages the weights U to be uniform. The KKT condition 
gives the relationship between the optimal primal and dual variables: 

_ eM-plr) . . 

''~E,reM-plry ' ^ ^ 

Different from the case of the hinge loss, here U is normalized as an entire matrix. Also we 
can solve the primal problem using simple (Quasi-)Newton, which is much faster than to 
solve the dual problem using convex optimization solvers. Note that the scale of the primal 
problem is usually smaller than the dual problem. After obtaining the primal variable, we 
can use the KKT condition to get the dual variable. The subproblem that we need to solve 
for generating weak classifiers also slightly differs from ([s]): 

m 

h*{-) = argmax Vf5r,?;,(Etif^*;) " Uir)h{x,). (14) 

General convex loss We generalize the presented idea to any smooth convex loss func- 
tions in this section. Suppose Q{-) is a smooth convex function defined in M. For classifi- 
cation problems, O(-) is usually a convex surrogate of the non-convex zero-one loss. As in 
the exponential loss case, we introduce a set auxiliary variables that define the margin as 
the pairwise difference of prediction scores. This auxiliary variable is the key to lead to the 
important Lagrange dual, on which the fully-corrective boosting algorithms rely. 
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The optimization problem can be formulated as 

m k 



min V V e{-pi r) + T^ WWL s.t.: (irol), and W >0. (15) 

1=1 r=l 

The Lagrangian is 

L = Y1 ®i-P^,r) - TriVW) + UiAHi.Wy^ - HrWr) - UirP^,r + l^Y^Wjr- 

i,r i,r i,r j,r 

We can again write its Lagrange dual as 

k m 

min Yl ®*(.-Uir) s.t.: Y [^r,y, iZtiUil) - U„] Hr. < 1^1^, Vr, (16) 

r=l i=l i 

where &*{■) is the Fenchel dual function of 0(-) ( |Boyd and Vandenberghe 2004). Note that 
&*{■) is always convex even if the original loss function 0(-) is non-convex. The difference 
is that the duality gap is not zero when 0(-) is non-convex. The KKT condition establishes 
the connection between the dual variable U and the primal variable at optimality: 

ut, = -veipl). (17) 

So we can actually solve the primal problem and then recover the dual solution from the 
primal. Prom (17), we know that the weight U is typically non-negative for classification 
problems because the classification loss function 0(-) is monotonically decreasing and its 
gradient is non-positive. 

In the next section we formulate the multi-class boosting algorithm using mixed norm 
regularization. We maximize the same margin defined in the previous section. 

2.2 MultiBoost with group sparsity (MultiBoostS'^°"P) 

The hinge loss with 2-norm regularization Given training samples, our goal is 
to minimize the multi-class hinge loss with ii^2 mixed-norm regularization. The primal 
problem can be written as 

m 

min y + i^l|l^l|i,2, s.t.: 5r,y, + Hi,Wy^ > 1 + Hr.Wr - ^i,yi,\fr; W>0;^>0. (18) 

1=1 



Here > is the regularization parameter. We rewrite (18) by introducing an auxiliary 
variable V: 

m 

^^,Y.^^+-\\ni,2 (19) 

i=l 

S.t.: 5r,y, + H,.Wy^ > 1 + Hi-Wr - ei,Vi,r; V = W; W > 0; ^ > 0. 

This auxiliary variable V splits the regularization term from the classification loss, and plays 
a critical role in deriving the meaningful dual problem. Actually ^ > is automatically 
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satisfied since the constraint, corresponding to the case r 
of The Lagrangian can then be written as 



Hi, ensures the non-negativeness 



m 

i = ^Ci + l^ll^l|l,2 - ^ Uir{5r,yi + Hi-Wy^ - 1 - Hi-Wr + 
i=l 



- Tr{Q^{uW - uV)) - Tr{P^W), 

where W, V and $, are primal variables and U, P and Q are dual variables (with [/ > 
and -P > 0). At optimum, the first derivative of the Lagrangian w.r.t. the primal variables, 
^, must vanish, dL/d^i = — Ylr^ir = l,Vi. The first derivative w.r.t. each column of W 
must also be zeros: 

The infimum over the primal variables V can be expressed as 



dwr 







UirHi- 



P-r 



uQ,, 



(20) 



iniL = ini-v{Q,V) + v\\V\\i2 

V V I u u , 

= -Z^EfSUp((3j:,yj:) +J^Y.jWj:\\2 



vY,j SUpQj.Fj: - ||V,-:||2 



(21) 



if ||Q,:||2<l,Vi, 

oo otherwise. 

Note that we use the fact that the convex conjugate of ||V^:||2 is the indicator function of 
the dual norm unit ball (Boyd and Vandenberghe , 2004). Hence the Lagrange dual can be 
written as 



mm 

U,Q 



(22) 



s-t.: J2ii^r,y^ - Uir)^: < i/Q;,,Vr; J2rUir = l,Vi; U > 0; WQj-.h < l,Vj. 



Since there can be infinitely many constraints, we need to use column generation to solve 

(23) 



(22) (Demiriz et al. , 2002). The subproblem for generating weak classifiers is 

h*{- 



argmax YlZii^r,y, 



Uir)h{Xi). 



h*{-) is the one that most violates the first constraint in the dual (22). The idea of column 
generation is that instead of solving the original problem with prohibitively large number of 
constraints, we consider instead a small subset of entire variable sets. The algorithm begins 
by finding a variable that most violates the dual constraints, i.e., the solution to (23), which 
corresponds inserting a primal variable into (18) or (19). The process continues as long as 
there exists at least one constraint that is violated for ( 22 ) . The algorithm terminates when 
we cannot find such a violated constraint. As in AdaBoost, the matrix U G i^j^x^ plays 
the role of measuring the importance of the training samples. The weak classifier which 
maximizes (23) is selected in each iteration. 
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The hinge loss with oo-norm regularization Similarly, the ^i^oo-norm regularized 
primal can be written as 



J^ei + '^lll^lli.oo (24) 

i=l 

s.t.: 6r,y^ + Hi-Wy^ > 1 + Hi Wr - ^i, Vi, Vr; V = W; W > 0; $ > 0. 



mm 



The Lagrangian of (24) can be written as 



L=Y,^i + ^\\V\\l,oo-Y,U^r{Sr,y,+Hi,Wy^-l-HrWr+^^)-Tr{Q^{uW-uV)) 
i=l i,r 

- Tr{P^W), 

with U > and P > 0. For ^i^oo-norm, the infimum over the primal variables V can be 
expressed as, 

infL = inf-z^(Q,y) + i/||y||i,oo (25) 

= -Z^Ej sup (Qj:, Fj:) +Z^^ .||yj: lloo 
= -^J2j SUpQ],Vj: - \\Vj:\\c 

1 oo otherwise. 

Here we make use of the fact that, 

riy) = supiy'x - \\x\\) = 1° '^J^ll*_^ ^' (26) 
X I oo otherwise, 

where || • ||* is dual norm of || • ||j^ Hence we can derive its corresponding dual as, 

^ UirSr,yi (27) 
i,r 

s.t.:Ei(^r,y, - U^r)Hij < uQ..r,\/r; ZrUir = l,Vi; U>0; \\Qj..\\i < 1, Vj. 

From the dual problem we see that the only difference between £i^2-norms and ^i^oo-norms is 
in the norm of the last constraint. This is not surprising since £p norm in primal corresponds 
to £q norm in dual with 1/p + 1/q = 1. 

2. We note here that £p norm in primal corresponds to £g norm in dual with 1/g = 1. For example, 

the Euclidean norm, || • II2 is dual to itself and the fi-norm, || • ||i is dual to the i!oo-norm, || • ||oo. 
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The logistic loss with £i 2-norm and oo-norm regularization In this secion, we 
consider the logistic loss with a mixed-norm regularization. The learning problem for the 
logistic loss in an £1^2 regularization framework can be expressed as 



m k 



min ^ V Vlog(l-hexp(-pir)) +z^||y||i,2 
s.t.: Pir = Hi-Wy^ - Hi:Wr,yi,yr; V = W; W >0. 



(28) 



The Lagrangian of (28) can be written as 

^ m k 

L =^ y^y]log(l + exp {-pir)) + I^||^||l,2 - y^C/jr(Pir - Hi-Wy^ + Hi-W,.) (29) 

777/ rC 



i=l T=l 



- Tr(Q' {vW - vV)) - Tr(P' W), 

with [/ > and P > 0. At optimum the first derivative of the Lagrangian w.r.t. each row 
of W must be zeros 



dL 

dWr 



0^ Yl iEiUu)Hi,-Y:,UirHi.. = P..r-uQ, 



^,r=yi 



^ E^ [^r,y, iEl Uil) " C/^r•] H,.. > -l^Q, 

for Vr. Take infimum over the primal variable, pir, 



dL 

dpir 



0^ Pir 



log 



-mkU* 



mkU^, 



1 



and 



inf L = — - — (1 + mkUir) log (1 -|- mkUir) — mkUir log (—mkUii 
Pir mk ^-^ 



(30) 



(31) 



(32) 



By reversing the sign of U, the Lagrange dual can be written as 

m k 



max 

U,Q 



— - \mkUir log (mkUir) + (1 — mkUir) log (1 — mkUir) 

mk ^-^ ^-^ I 

1=1 r=l 



(33) 



s.t.: Y.i[^r,y, {T.lUil) - UirjHi; < uQ.r^Mr] \\Qj:\\2 < l,Vj. 

Through the KKT optimality condition, the gradient of Lagrangian over primal variables p 
and dual variables U must vanish at the optimum. The solutions of (28) and (33) coincide 
since both problems are feasible and satisfy Slater's condition. One can find the solution 
by solving either problem. The relationship between the optimal values of p and U can be 
expressed as 



U* 



exp(-/jt) 



mk[l + exp(-p^^)) 



(34) 
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As is the case for the hinge loss, the dual of the £i^oo-iiorm regularized logistic loss can be 
written as 

m k 



max 

U,Q m.. 

1=1 r=l 



— \mkUir log (mkUir) + (1 — mkUir) log (1 — mkUir) (35) 

i=l r=l 

S.t.: Y^i[Sr,y, (Y^lUu) - Uir]Hi: < uQ.r^Mr] \\Qj:\\l < l,Vj. 



General convex loss with arbitrary ^i p-norm regularization In this section, we 
generalize our idea to any convex loss functions with any mixed- norm regularizers. As 
before, we define 0(-) as a smooth convex function and as any well-established regu- 
larization function^ We define the margin as the pairwise difference of prediction scores. 
The general mixed-norm regularized optimization problem that we want to solve is, 

m k 

^ifp ^ 5] e(/5,,) + 1/^17(^,0 (36) 

' ' i=l r=l j 

S.t.: pir = Hi-Wy. — Hi.Wr,\/i,\/r; and V = W;W > 0. 



The Lagrangian of (36) is 

m k 

L =^^Q {pir) + V^n{Vj.) -Y^Uiripir-Hi-Wy^+Hi-Wr) (37) 

i=l r=l j i,r 

- Tr{Q^{uW - uV)) - Tr{P^W), 
Following our derivation for multi-class logistic loss, the Lagrange dual can be written as, 

m k 
' i=l r=l 

s-t.: Ei[^r,y, (El Uii) - mr]Hr. < i/Q:„Vr; and ^*{Qr) < l,Vi. 

where is the Fenchel dual function of /(•) and f2*(-) is the Fenchel conjugate of ^{■). 

Through the KKT condition, the relationship between the dual variable U and the primal 
variable 

Ut = -Ve{pt), (39) 

holds at optimality. It is important to note here the difference between MultiBoost^^ (having 
ii penalty) and MultiBoost ^''""p (having mixed- norm penalty). Although both dual vari- 
ables, U, have the same expression, i.e., each dual variable is defined as the negative gradient 
of the loss at pir, the solution to the primal variables, W, are different. MultiBoost^^ does 
not enforce group sparsity and is unable to exploit the existence of structural features. The 
details of our boosting algorithm are given in Algorithm [2] 

3. Here we assume a non-overlapping group structure. This assumption is always valid since W = 
[■wi,W2, ■ ■ ■,Wk] and n"=i = 0- 
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Algorithm 2 MultiBoost with shared weak classifiers via group sparsity. 
Input: 

1) A set of examples {xi,yi}, i = 1- ■ - m; 

2) The maximum number of weak classifiers, T; 

Output: A multi-class classifier F{x) = argmax^J^-^ Wrjhj{x); 

r 

Initilaize: 

1) t ^ 0; 

2) Initialize sample weights, Uir = l/{mk); 
1 while t < T do 



1) Train a weak learner, ht 



hinge loss 



argmax^^^ [5r,yi - Uir] h{xi), 

h{-)e'K,r 

argmaxX;™=i [Sr,y, (Ez ^u) - ^iA H^i), logistic 



2) If the stopping criterion has been met, we exit the loop. 



if Ei^l [^r,y, - Uir] h{Xi) 

1^ break; (hinge loss) 



< ly + e then 



if 



< u + e then 



ET=1 [Sr,y, {i:iUu)-Ur]h{Xi) 

1^ break; (logistic loss) 

3) Add the best weak learner, ht{-), into the current set; 

4) Solve either the primal or the dual problem (we solve the dual (22)) for the 
hinge loss case; or solve the primal problem (28) using ADMM for the logistic 
loss case; 

5) Update sample weights (dual variables); 

6) t ^ t + 1; 



Theorem 1 (Convergence property). Both ii-norm and ii^p-norm regularized boost- 
ing algorithms are guaranteed to converge to an optimum of any convex loss functions 
provided that both algorithms makes progress at each boosting iteration. In other words, 
as long as the objective value decreases, both algorithms optimize (36) globally to a desired 
accuracy 



Proof Here we consider MultiBoost . The proof of MultiBoost would follow the 
same discussion. Our proof relies on the fact that the £i regularizer forces the set of possible 
solutions to be sparse and each column generation iteration guarantees the objective value 
to be smaller. We first assume that the current solution is a finite subset of weak learners, 
{hj{')}j=i- ff add a weak learner, hn{-), that is not in the current subset, and the 
corresponding coefficient, Wr^n = 0, Vr, the solution must remain unchanged. We can simply 
conclude that the current set of weak learners, {hj{-)}^zl, and their coefficients, Wr,yr, are 
already at the optimal solution. 
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Next, we consider the case when the optimahty condition is violated. We need to show 
that we can find a weak learner, hn{-), which is not in the current set and 3r : Wr^n > 0. Let 
us assume that hn{-) is the base learner found by solving Step 1 in Algorithm [T| and the 
stopping criterion (Step 2) has not been met. Hence 3r : YliLi [^r,yi — Uir] hn{xi) > u. If 
after the weak learner hn{-) is added into the primal problem, the primal solution remains 
unchanged, that is, Wr^n = 0,Vr. Based on the optimality condition: 

inf L' = inf 1/ - [5r,y^ - Uir] Hi-, -vl.\wr. 

\ 1=1 J 

At optimum, the first derivative of L' w.r.t. the primal variables must vanish, i.e., V must 
be 0. But 3r : Vr^n = [<^r,j/,; — Uir] hn{-) < 0. This contradicts the fact the Lagrange 

multiplier, Vr^n, must be greater than or equal to zero. 

We can conclude that after the base learner /in(-) is added into the primal problem, 
3r : Wr,n > 0. Since one more primal variable is added into the problem, the objective 
value of the primal problem must decrease. A decreasing in the objective value guarantees 
that the algorithm makes progress at each iteration. Since all optimization problems are 
convex, there exists no local optimal solution. Therefore the proposed column generation 
based boosting is guaranteed to converge to the global optimal solution. 



2.3 Implementation 



Note that the dual problem of hinge loss, (22), is a conic quadratic optimization problem 



involving several linear constraints and quadratic cones. We use the Mosek optimization 



solver to solve (22) which provides solutions for both primal and dual problems simultane- 



ously using the interior-point method. For the logistic loss formulation the primal problem 
has nk variables and mk simple constraints (28). The dual problem has mk variable^ and 
nk constraints. In boosting, we often have more training samples than final weak classifiers 
(m ^ n). However, the £i,2-norm is not differentiable everywhere, and thus to solve (28) we 



apply the ADMM method (Boyd et al. , 2011). ADMM decouples the regularization term 
from the logistic loss by introducing additional auxiliary variables. The algorithm then 



solves (28) by using an alternating minimization approach. ADMM formulates the original 



problem as the following, 



mmf{W)+g{Z) s.t.: W 



Z. 



(40) 



Here f{W) is any convex loss functions (36) and g{Z) 



Q(Z) is any regularization 



functions. As in the method of multipliers, we form the augmented Lagrangian, 



Le = fiW) + g{Z) + {U,W - Z) + - \\W - Zy. 



(41) 



4. Here we ignore the equality constraints since they can be put back into the original cost function. 
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Here 9 is the augmented Lagrangian parameter (0 > 0) . The method of multipUers for ( 40 ) 
has the form, 



S + l + 

TS+l TTS I anitrs+l 7S+1\ 



{W'+^,Z'+^)= argmmLe(VF,Z,C/") (42) 
w,z 



^jjs^ OiW'-^' - Z'+'). (43) 

Here the Lagrangian is minimized jointly with respect to both W and Z variables. Since 
it is expensive to solve a joint minimization in (42), both primal variables iyV and Z) 
are updated in an alternating fashion. This alternate update scheme is known as ADMM. 
ADMM consists of the following iterations, 

H^^+i = argminL0(M/,Z^C/') (44) 
w 

= argminL0(l^"+\Z,;7') (45) 
z 

As an example, we regularize the above logistic loss with a mixed-norm £1^2 regularizer. We 
can rewrite (44) and (45) as, 

^ m k Q 

W'+^= argmin log {I + e-x.v{-p„)) + {U'fW + -\\W - Z'\\l (47) 

W mk ^ ^ 2 

1=1 r=l 

Z'+^ = argmin v\\Z\\i^2 - {U'f Z + ^-\\W'+^ - Z\\l. (48) 



Here pir = Hi-Wy. — Hi-Wy-. Since (47) is now smooth and differentiable everywhere, a 
quasi-Newton method such as L-BFGS-B can be used to efficiently solve (47). For (48), a 



closed- form solution exists and it can be computed through sub-differential calculus (Boyd 



et al. , 2011). The solution is known as a block soft thresholding, 

Zfi = §,/,(^+^ + ^;j,Vj, (49) 
where S is a vector soft thresholding operator defined as 

$^{x) = {1-k/\\x\\2)+x, (50) 

with S(0) = 0. A brief summary of ADMM in provided in Algorithm [sj 

Distributed optimization via ADMM We describe here how to exploit distributed 
computing in ADMM to speed up the training time of our proposed approach. In order to 
solve the problem in a distributed fashion, we first separate the loss function across ^max 
blocks of data. We redefine our problem as, 

Qmax 

min ^li^l) + ^ • ^iZ) s-t" Wq- Z = Q,q = l,..., q^^, (51) 

q=l 
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Algorithm 3 ADMM for solving ([28| 



Input: 

1) Outputs of weak classifiers, H; 

2) Augmented Lagrangian parameter, 9; 

3) The maximum number of iterations, Sj^ 
Output: An optimal W*; 
Initilaize: 

1) s ^ 0; 

2) W^, ZO, C/O; 
1 repeat 



W^+^ = argmin ^ Eti + ^xp i-p^r)) + {U^VW + l\\W - Z%- 

w 

Update using ^] 



s ^ s + 1; 
if s > Smax then 
1^ break; 

8 until convergence ; 



where Qq refers to the loss function for the q'-th block of data. Similar to the previous 
section, ADMM considers the following iterations, 

= argmin L,(VF„ Z\ U'),yq; (52) 

= argminL,(VF^i, . . . , W^^j;^^, Z, U^J; (53) 

= + 0{W^^^ - Z^+i), Vg = 1 • • • ftnax, (54) 

where 6 is the augmented Lagrangian parameter (^ > 0). The resulting ADMM algorithm 
for ([52]) and ((53]) is 

= argmin lg{Wg) + {U^VWg + ^\\Wg - Z^lgyq, (55) 



7S + 1 



9m ax yj 

argmin • 0(Z) + (- (f/^^f Z + -||iy,^+i - , (56) 



= + e(VFg^+i - Z^+^), V(?, (57) 

where W'^^ = Wf,^^ and = ^ Y^'^^^ For MuhiBoost^S the Z-update 

is a soft threshold operation, i.e., 

Z^+i = argmin u ■ 0(Z) + J] (- (C/,^)^ Z + - (58) 

^ 9=1 

= argmin u\\Z\\i + iq^,^9/2)\\Z - W'+^ - {l/e)U'\\l 
z 
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= S./{ew) (H7+' + (l/e)f^:),Vj. 

The soft thresholding operator S applied to a scalar is defined as 

§^ix) = {1 - k/\x\)+x. (59) 
for X 7^ 0. For MultiBoostS'^°"P, e.g., ^1^2-norm regularizer, a closed-form solution exists for 



( 56 ) and it can be computed as 



Z 



Qmax f\ 

argmin v ■ Q{Z) + ^ (- {U^f Z + -\\W^+^ - Zg), 
z 



(60) 



9=1 

argmin u\\Z\\i^2 + {qmi,^0/2)\\Z - W' 
z 



i+1 



In this case, note that § is the vector thresholding operator defined in (50). 



Here we assume that X]a=T "^g ~ the sum of the number of samples in each 



block is equal to the total number of samples. The first step, (55), can be carried out 



independently in parallel for each block of data. In other words, we distribute (55) to each 



thread or processor. The second step, (58) or (60), gathers variables computed in (55) to 
form the average. After the final step, (57), the value of Ug~^^ is then distributed to the 
subsystems. 



For both hinge loss and logistic regression, we can rewrite (55) as, 

1 

W^+' = argmin — J] + {U^VW, + \\Wg - Z'\\l 



mn 



1 
= argmin _ ^ ^ log(l + exp (-p,,)) + {U'^fW^ + -\\Wq- Z%- 



argmin 



(61) 
(62) 



respectively. 

2.4 Faster training of multi-class boosting 

Although we have combined ADMM with L-BFGS-B for faster training of multi-class logistic 
loss, the resulting algorithm is still computationally expensive to train. The drawback of 



(28) is that the formulation cannot be separated for faster training. Since real- world data 



often consists of a large number of samples and classes, the training procedure can be very 
slow. 

In order to improve the training efficiency of the classifier we thus propose here another 
variation of the multi-class boosting based on the logistic loss. This variation is achieved 
through a simplification of the form of pir in (28) to pir = yirHi-Wr where yir = 1 if = r 
and Vir 



-1, otherwise. Note that this formulation was originally introduced in |Chapelle| 



and Keerthi (2008) for multi-class as well as multi-label support vector machine (SVM) 
learning and proved to be effective. To our knowledge, this formulation of multi-class loss 
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function has not been applied to boosting. Here we extend it to multi-class boosting. The 
fast training (fast) formulation is: 

^ m k 

min — - log(l + exp (- pir)) +i^\\W\\i ^2 s.t.: pir = yirHi-Wr^Mi^Mr; W >Q. (63) 

w,p mk ^-^ ^-^ ^ 

i=l r=l 

The Lagrange dual can be written as 

m k 



— - ImkUir log (mkUir) + (1 — mkUir) log (1 — mkUi, 

mk ^-^ ^-^ L 

j=l r=l 

S.t.: YliUiryirHi: < vQ.,r,\lr; \\Qj:\\2 < l,Vj. 



max 

u 



(64) 



The relationship between p and Uir is the same as (34). We replace steps 1 and 2 in 



Algorithm [g] w ith the constraint in ( 64 ) and step 4 in Algorithm [2] with the optimization 
problem in (163^. As in Chapelle and Keerthi (2008), it is easy to apply the above formulation 



to multi-label classification, where each example can have multiple class labels. We leave 
this for future work. 

Parallel optimization for fast boosting The computational bottleneck of Algorithm[3] 
lies in minimizing H^''"'"^. By simplifying the margin as pir = yirHi-Wr, we can solve each 
Wr,yr independently. This speeds up our training time by a factor proportional to the 
number of classes. Let us define W = [wi,W2, . . . , wj^G M"^'^, Z = [zi,Z2, . . ■ , z^] € M"^^ 
and U = [tti, ti2, . . . , Uk] G M"^'^, line 2 in Algorithm ^ can simply be replaced by, 



^ rn k Q 

+^ =argmin — ^^log(l + exp(-pir)) + [u^f w + - \\w - z^|||, Vr. 



(65) 



1=1 r=l 



Even without a multi-core processor, solving a series of (65) is still faster than solving line 



2 in Algorithm [3} Distributed optimization can also be applied to our algorithms to further 
speed up the training time. The idea is to distribute a subset of training data in (65) to 



each processor and gather optimal lo^"*"^ to form the average. Interested readers should 



refer to Chapter 8 in Boyd et al. (2011). 



3. Experiments 
3.1 MultiBoost^i 

We first performed a few sets of experiments to compare MultiBoost with previous multi- 
class boosting algorithms. For fair comparison, we focus on the multi-class algorithms 
using binary weak learners, including AdaBoost.MO and AdaBoost.ECC, which are still 
considered as the state-of-the-art. For AdaBoost.MO, the error-correcting output codes are 
introduced to reduce the primal problem into multiple binary ones; for AdaBoost.ECC, the 
binary partitioning is made at each iteration by using the "random-half method, which has 



been experimentally proven better than the optimal "max-cut" solution Li (2006). Decision 



stumps are chosen as the weak classifiers for all boosting algorithms, due to its simplicity 
and the controlled complexity of the weak learner. 
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(a) mean images of "1", "6" and "9" 




(b) AdaBoost.ECC 




(c) MultiBoost-hinge 



(d) MultiBoost-cxp 

Figure 1: Plot (a) shows the mean images of the samples belonging to digits "1", "6" 
and "9" . Each block is a feature and is numerically indexed. The remaining plots 
illustrate the classification models trained on this data set by (b) AdaBoost.ECC, 
(c) MultiBoost-hinge and (d) MultiBoost-exp. Red circles indicate that weak 
classifiers on these features should take large values; Green crosses indicate small 
values should be taken on these features in order to make correct classification. 
The width of a mark is proportional to the weight of the stump. We can see that 
MultiBoost-hinge is slightly better than AdaBoost.ECC, e.g., on the 43-th and 
21-th features. 



Convex optimization problems are involved in MultiBoost-hinge and MultiBoost-exp. 
To solve them, we use the off-the-shelf Mosek convex optimization package, which provides 
solutions for both primal and dual problems simultaneously with its interior-point Newton 
method. We also need to set the regularization parameter v for these two algorithms using 
cross validation. For each run, a five-fold cross validation is carried out first to determine 
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the best i/. Notice that the loss functions in MultiBoost-hinge and MultiBoost-exp may 
have different scales, we choose the parameter from {10"'', 10-^ 5 • 10'^, 0.01, 0.02, 0.04, 
0.05} for the former, and the candidate pool {10"^, 10"^, 5 • lO"'^, 8 • lO"'^, 10"^, 2 • 10"^, 
4 • 10-^ 8 • 10-6, 10-5} for the latter. 

Toy data In the first experiment, we make the comparison on a toy data set, which 
consists of 4 clusters of planar points. Each cluster has 50 samples, which are drawn from 
their respective normal distribution. As shown in Figure [2]^a), the centers of the circles 
indicate where their means are, and the radii depict the different deviations. We run the 
boosting algorithms on this toy data set and plot the decision boundaries on the x-y plane. 
Figures [2]^b)-(e) illustrate the results when the number of training iterations is set to be 100. 
In this case, it is hard to state which model is better. However, if we increase the iteration 
to 5000 times, the planes in (f) and (g) are apparently over segmented by AdaBoost.MO 
and AdaBoost.ECC. On the contrary, the decision boundaries of (h) MultiBoost-hinge and 
(i) MultiBoost-exp seem closer to the true decision boundary. Unlike the others, models 
trained by AdaBoost.MO are more complex, since this learning method assembles i weak 
classifiers rather than one at each iteration if ^-length codewords are used. Empirically we 
see that AdaBoost.ECC also seems susceptible to over-fitting. 

UCI data sets Next we test our algorithms on 7 data sets collected from UCI repository. 
Samples are randomly divided into 75% for training and 25% for test, no matter whether 
there is a pre-specified split or not. Each data set is run 10 times and the average results 
of test error are reported in Table [T] The maximum number of iterations is set to 500. 
Almost all the algorithms converge before the maximum iteration. Again the regularization 
parameter is determined by 5-fold cross validation. 

Table [T] reports the results. The conclusion that we can draw on this experiment is: 
1) Overall, all the algorithms achieve comparable accuracy. 2) our algorithms are slightly 
better in terms of generalization ability than the other two on 5 out of 7 data sets. Multi- 
Boost-exp outperforms others in 4 data sets. 3) Also note that the performance Multi- 
Boost-hinge is more stable than MultiBoost-exp, which may be due to the fact that the 
hinge loss is less sensitive to noise than the exponential loss. 



dataset 


AdaBoost.MO 


AdaBoost.ECC 


MultiBoost-hinge 


MultiBoost-exp 


thyroid 


0.005±0.001 


0.005±0.001 


0.005±0.001 


0.004±0.001 


dna 


0.059±0.005 


0.064±0.005 


0.057±0.007 


0.061±0.004 


wine 


0.036±0.025 


0.034±0.029 


0.032±0.018 


0.030±0.029 


iris 


0.062±0.017 


0.073±0.021 


0.068±0.022 


0.057±0.022 


glass 


0.232±0.047 


0.242±0.053 


0.234±0.046 


0.315±0.086 


svmguide2 


0.213±0.039 


0.214±0.030 


0.222±0.052 


0.206±0.040 


svmguide4 


0.192±0.018 


0.191±0.018 


0.207±0.018 


0.214±0.027 



Table 1: Test errors of four boosting algorithms on UCI data sets. The average results of 10 
repeated tests are reported. Weak classifiers are decision stumps. MultiBoost-exp 
is the best on 4 out of 7 data sets. 
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Toy data 




-0.5 0.5 1 1.5 

(a) data 



(b) AdaBoost.MO, 100 itcra- (c) AdaBoost.ECC, 100 itera- 
tions tions 





(d) MuItiBoost-hinge, 100 itera- (e) MultiBoost-exp, 100 itera- (f) AdaBoost.MO, 5000 itera- 
tions tions tions 




+^+.+++ 




O O O OD, 
O 




(g) AdaBoost.ECC, 5000 itera- (h) MultiBoost-hinge, 5000 iter- (i) MultiBoost-exp, 5000 itera- 
tions ations tions 



Figure 2: Figure (a) shows a toy data set, which contains 4 classes and a total of 200 sam- 
ple points. Boosting algorithms are trained on this set using decision stumps. 
Plots (b)-(e) illustrate the decision boundaries made by (b) AdaBoost.MO, (c) 
AdaBoost.ECC, (d) MultiBoost-hinge and (e) MultiBoost-exp with the number 
of training iterations being 100. For comparison, plots (f)-(i) illustrate the deci- 
sion boundaries of these algorithms, respectively, when the number of iterations 
is 5000. (f) and (g) apparently suffer from over-fitting. 
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5 categoies of images in Caltech-256 




number of iterations 



Figure 3: Test accuracy curves of four boosting algorithms on 5 categories of Caltech-256 
images. The weak classifiers are decision stumps and the number of training 
iterations is 500. The average results of 10 runs are reported. Each run randomly 
selects 75% data for training and the other 25% for test. 



Handwriting digits recognition To further examine the effectiveness of our algorithms, 
We have conducted another experiment on a handwritten digits data set, which is also from 
UCI repository. The original data set contains 5620 digits written by a total of 43 people 
on 32 X 32 bitmaps. Then the bitmaps are divided into 4x4 non-overlapping blocks, and 
an 8 X 8 descriptor is generated by calculating the sum of 0-1 pixels in each block. For ease 
of exposition, only 3 distinct digits of "1", "6" and "9" are chosen for classification. Figure 
[T]^a) illustrates the mean images of their training data examples of the three digits. The 
index of each block (feature) is also printed on Figure [T]^a) for the convenience of exposition. 

We train multi-class boosting on this data set. The number of maximum training 
iterations is set to 500. 75% data are used for training, and the rest for test. Again 5-fold 
cross validation is used. We still use decision stumps as the weak classifiers. Boosting 
learning with decision stumps implies that we select features at the same time. In other 
words, decision stumps select most discriminative blocks for classifying these digits. The 
four compared algorithms have similar performances on this test with nearly 98% test 
accuracy. We plot the models of AdaBoost.ECC, MultiBoost-hinge and MultiBoost-exp in 
Figures [T]^b)-(d). AdaBoost.MO can be hardly illustrated as it involves a multi-dimensional 
coding scheme. Notice that a decision stump divides the value range of the feature into two 
parts, on which there are necessarily two different attributions, we use red circles and green 
crosses to represent the positive and negative parts. For example, if a decision stump on 
the 10-th feature is xio > r and assigns a set of weights {0.5,0.2,0.8} to three labels, we 
mark 10-th block in the third digit image with a red circle, and 10-th block in the second 
digit with a green cross; if the stump is xio < t with the same weights, we do the opposite 
marks. In other words, red circles indicate the decision stumps should take bigger values on 
these blocks, while green crosses indicate these classifiers should take some values as small 
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row) images by MultiBoost-hinge. The categories are "cartman" , "headphones" , 
"iris", "paperchp" and "skunk". The accuracy of this test is 71.2%. No image is 
falsely classified into the category of "paperclip" . 



as possible. The width of a mark stands for the minimal margin defined in Equation (10), 
that is, in the i-th digit, the width is proportional to h{x)wy. — max{/i(a;)iUr}, Vr ^ yi. 
Some features may be selected multiple times, which divide the value range into several 
segments. In this case, we neglect all the middle parts. 

Clearly, all the results of three algorithms on feature selection make sense. Most dis- 
criminative features are tagged with circles or crosses. Some blocks that contain significant 
information on luminance are tagged with thick marks, such as the 22-th and 43-th features 
in digit "6", and the 22-th and 11-th in "9". If taking a close look at the figure, we can find 
MultiBoost-hinge is slightly better than AdaBoost.ECC. For example, on the 43-th feature 
the green cross should be marked on digit "9" instead of "1". Also in "1", the 21-th feature 
should be tagged with a relatively thicker circle. However, MultiBoost-exp's results are 
not as meaningful as MultiBoost-hinge. 

Object recognition on a subset of Caltech-256 Finally, we test our algorithms on 
the data set of Caltech-256, which is one of the most popular multi-class benchmarks. We 
randomly select 5 categories of images. 75% of them are randomly selected for training and 
the other 25% for test. A descriptor of 1000 dimensions is used, which combines quantized 



color and texture local invariant features (also called visterms (Quelhas and Odobez, 2006)). 



The maximum number of iterations is still set to 500. The averaged test accuracies of 10 
runs are reported in Figurejsj Again, we use the simplest decision stumps as weak classifiers. 
We can see that all the four boosting algorithms perform similarly, except that MultiBoost- 
exp performs worse than the other three. It may be due to the fact that we have not fine 
tuned the cross validation parameter. We show some images that are correctly classified 
and falsely classified by MultiBoost-hinge in Figure |4j 



3.2 MultiBoosts™^P 

Next we evaluate our mixed-norm regularized boosting algorithms. We mainly use the ii^2 
regularization since ^i^oo delivers similar performance. In order to ensure a fair comparison 
we evaluate the performance of the proposed algorithms against other multi-class boosting 



algorithms evaluated previously, along with AdaBoost-SIP (Zhang et al. , 2009), JointBoost 



dTorralba et al.||2007[ ), GradBoost (£i/£2-regularized) ( |Duchi and Singerf |2009D . Note that 
the last three also try to share features across classes. 
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# feat. 


Ada.ECC 


Ada.MH 


JointBoost 


MultiBoost ^1 


MultiBoost s™"P 


20 
100 
500 


0.62/0.68 
0.23/0.33 
0.08/0.20 


0.48/0.53 
0.17/0.24 
0.09/0.18 


0.71/0.71 
0.44/0.50 
0.24/0.38 


0.10/0.14 

0.05/0.13 
0.03/0.10 


0.10/0.14 
0.03/0.10 
0.02/0.09 



Table 2: Training/test errors of a few multi-class boosting methods on the 2D toy data 
set. The proposed MultiBoost with hinge loss performs slightly better than 
others. See Figure [5] for an illustration. 



Artificial data We consider the problem of discriminating 6 object classes on a 2D plane. 
Each sample consists of 2 measurements: orientation and radius. For all classes, the ori- 
entation is drawn uniformly between and 27r. The radius of the first group is drawn 
uniformly between and 1, the radius of the second group between 1 and 2, and so on. 
We generate 50 samples in the first group, 100 samples in the second group, 150 samples 
in the third group, and so on. The number of training sets is the same as the number of 
test sets. In this example feature vectors are the vertical and horizontal coordinates of the 
samples. We train 5 different classifiers based on the proposed MultiBoost ^"^""p (hinge loss), 
AdaBoost.MH ( [Schapire and Singer[ |1999[ ), AdaBoost.ECC ( [Guruswami and Sahai] |1999| ) 



and JointBoost (Torralba et al. 2007). The multi-class classifier is composed of a set of 



binary decision stumps. For our algorithm, we choose the regularization parameter v from 
{10~^, 10~^, 10"'^, 10~^, 10~^}. For JointBoost, we set the outermost class (maximal radius) 
as background. We evaluate 5 boosting algorithms on this toy data and plot the decision 
boundary in Figure [S] Table [2] reports some training and test error rates. Our algorithm 
performs best amongst five evaluated classifiers. We conjecture that the poor performance 
of JointBoost is due to the small number of background samples in the training data. Joint- 
Boost was designed for the task of multi-class object detection where the objective is to 
detect several classes of objects from background samples. The algorithm might not work 
well on general multi-class problems. We then repeat our experiment by increasing the num- 
ber of iterations to 500, and JointBoost, Adaboost.MH and AdaBoost.ECC still perform 
poorly on this toy data set compared to our approach. 



UCI data sets The second experiment is carried out on some UCI machine learning 
data sets. Since we are more interested in the performance of multi-class algorithms when 
the number of classes is large, we evaluate our algorithm on 'segment' (7 classes), 'USPS' 
(10 classes), 'pendigits' (10 classes), 'vowel' (11 classes) and 'isolet' (26 classes). All data 
instances from 'segment' and 'vowel' are used in our experiment. For USPS, pendigits and 
isolet we randomly select 100 samples from each class. We use the original attributes for 
USPS (256 attributes) and isolet (617 attributes). For the rest, we increase the number 
of attributes by multiplying pairs of attributes. Each data set is then randomly split into 
two groups: 75% samples for training and 25% for evaluation. In this experiment, we 
compare MultiBoost s^'^p (logistic loss) to AdaBoost.MH (Schapire and Singer, 1999), Ad- 



aBoost.ECC (Guruswami and Sahai, 1999) and GradBoost (£i/£2-regularized) (Duchi and 



Singer, 2009). The regularization parameter is first determined by 5-fold cross validation. 



26 



A DIRECT APPROACH TO MULTI-CLASS BOOSTING 



AdaBoost.ECC AdaBoost.MH JointBoost MuItiBoost^i MuItiBoost 6"^°"? 




Figure 5: Decision boundaries on a toy data sets, with Top row: 20 weak classifiers Middle 
row: 100 weak classifiers and Bottom row: 500 weak classifiers. Note that some 
multi-class algorithms end up with very complicated and multi-modal decision 
boundaries. 



For GradBoost, we choose the regularization parameter from {10 ^,5-10 ^,10 ^,5 • 
10^3, 10^2^ 5-10-2, 10-1, 5-10-1}. For our algorithm, we choose the regularization parameter 
from {10-^ 5 - IQ-'^, 10"^, 5 - 10'^, 10"^, 5 - 10-^ 10"^, 5 - lO""^, IQ-^}. Ah experiments are 
repeated 10 times using the same regularization parameter. The maximum number of 
boosting iterations is set to 500. We observe that almost all the algorithms converge earlier 
than 500 in this experiment. We plot the mean of test errors versus proportion of features 
used in Figure [6j These results show that our proposed approach consistently outperforms 
its competitors. On the 'segment' and 'vowel' data sets we observe that both MultiBoost^^ 
and MultiBoost ^''""P perform similarly. We suspect that this is because the number of 
attributes in both data sets is quite small, and thus that there is little advantage to be 
gained through feature sharing on these data sets. Our approach often has the fastest 
convergence rate (note, however, that GradBoost converges faster on the USPS data sets 
but ends up with a larger test error). 

Comparison between GradBoost and our algorithm GradBoost with mixed-norm 



regularization (Duchi and Singer, 2009) is similar to the method presented here. The 
distinction, however, is that our method minimizes the original convex loss function rather 
than quadratic bounds on this function. The result is that our method is not only more 
effective, but also more general, as it can be applied not only to the logistic loss function 
but also to any convex loss function. In addition, our approach shares a similar formulation 
to standard boosting algorithms, i.e., the way we generate weak learners or update sample 



weights (dual variables in our algorithm) . The algorithm of Duchi and Singer ( 2009 ) is rather 



heuristic and it is not known when the algorithm will converge. Furthermore, GradBoost is 
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SEGMENT (7 classes) SVMGUIDE4 (6 classes) PENDIGITS (1 classes) 




Proportion of used features Proportion of used features Proportion of used features 



Figure 6: The performance of our algorithm (MultiBoost^'^°"P) compared with various 
boosting algorithms on several machine learning data sets. The horizontal axis is 
the fraction of used features and the vertical axis is the test error rate. We observe 
that group sparsity-based approaches (ours and GradBoost) generally converge 
faster than other algorithms. 



more similar to FloatBoost (Li and Zhang, 2004) where the authors introduce a backward 



pruning step to remove less discriminative weak classifiers. The drawback of pruning is 1) 
being heuristic and 2) a prolonged training process. 

ABCDETC and MNIST handwritten data The NEC Lab ABCDETC sets consist 
of 72 classes (digits, letters and symbols). For this experiment, we only use digits and 
letters (10 digits, 26 lower cases and 26 upper cases). We first resize the original images to 
a resolution of 28 x 28 pixels and apply a de-skew pre-processing. We then apply a spatial 
pyramid and extract 3 levels of HOG features with 50% block overlap. The block size in each 
level is 4 X 4, 7 X 7 and 14 x 14 pixels, respectively. Extracted HOG features from all levels 
are concatenated. In total, there are 2, 172 HOG features. For ABCDETC, we randomly 
select 5 samples from each class as training sets and 120 samples from each class as test sets. 
For MNIST, we randomly select 100 samples from each class as training sets and used the 
original test sets of 10, 000 samples. In this experiment, we also compare the performance 
of MultiBoost ^""P with a fast training variant, MultiBoost|A°T''. All experiments are run 
10 times with 500 boosting iterations and the results are briefly summarized in Table |3j 
From the table, both MultiBoost s™"? and MultiBoostfAsx^ perform best compared to other 
evaluated algorithms, especially on ABCDETC test sets where the number of classes is 
large. We observe the fast approach to perform slightly better than MultiBoost^'^""''. In 
our work, the advantage of the fast approach compared to MultiBoost s''°"p is that the 
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MNIST ABCDETC 


Ada.MH (ISchapire and Singer', 'iggQ") 
Ada.ECC |Gufuswami and Sahai, 1999) 
Ada. SIP (Zhang ct al., 20091 


3.0 (0.2) 63.4 (1.8) 

3.1 (0.2) 70.5 (1.1) 
4.4 (1.3) 62.7 (1.2) 
5.3 (0.3) 73.9 (1.3) 
3.7 (0.2) 73.2 (0.7) 
3.1 (0.2) 59.1 (1.1) 
3.0 (0.3) 58.2 (0.9) 


GradBoost (Duchi and Singerj |200S 1 


MultiBoosf^ 
MultiBoost'^™"^ 
MultiBoostll""" 



Table 3: Test errors (%) of a few multi-class boosting methods on the MNIST and 
ABCDETC handwritten data sets. All experiments are run 10 times with 500 
boosting iterations. The average error mean and standard deviation (in percent- 
age) are reported. 



MNIST 


'0- 


-3' 


'4-5' 


'6-7' 


'8 - 10' 


MultiBoost^i 


99.^ 


]% 


0.2% 


0% 


0% 


MultiBoosts™"P 


4.5^ 


% 


48.8% 


40.9% 


5.8% 


MultiBoostf™"'' 


10.] 


% 


69.9% 


19.7% 


0.3% 


ABCDETC 


'0- 


- 15' 


'16-30' 


'31-45' 


'46 - 62' 


MultiBoost^i 


99.e 


1% 


0.2% 


0% 


0% 


MultiBooste''°"P 


0% 




81.3% 


18.7% 


0% 


MultiBoostfls"'' 


0% 




65.7% 


33.5% 


0.7% 



Table 4: The distribution of shared weak classifiers. For example, '8 — 10' indicates that 
the weak classifier is being shared among 8 to 10 classes. The table illustrates the 
feature sharing property of our algorithms, i.e., one weak classifier is being shared 
among multiple classes. 



training time can be further reduced by exploiting parallelism in ADMM, as previously 
mentioned. Table |4] illustrates the feature sharing property of our algorithms. Clearly we 
can see that the group sparsity regularization indeed encourages sharing features. 



Scene recognition In the next experiment, we compare our approach on the 15-scene 



data set used in Lazebnik et al. (2006). The set consists of 9 outdoor scenes and 6 indoor 
scenes. There are 4, 485 images in total. For each run, the available data are randomly 
split into a training set and a test set based on published protocols. This is repeated 5 
times and the average accuracy is reported. In each train/test split, a visual codebook is 
generated using only training images. Both training and test images are then transformed 



into histograms of code words. We use CENTRIST of Wu and Rehg (2011) as our feature 



descriptors. 200 visual code words are built using the histogram intersection kernel (HIK), 
which has been shown to outperform fc- means and /c-median (Wu and Rehg, 2011| ). We 



represent each image in a spatial hierarchy manner (Bosch et al. 



2008 ) . Each image consists 



of 31 sub-windows. An image is represented by the concatenation of histograms of code 
words from all 31 sub-windows. Hence, in total there are 6, 200 dimensional histogram. 
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methods 



SAMME^ dZhu et al. 



JointBoost^^ (Torralba et al. 



2007 ) 

MultiBoosFi 



2009) 



AdaBoost.SIP ( Zhang et al. 



AdaBoost.ECC (Guruswami and Sahai 



AdaBoost.MH (Schapire and Singer 



2009 



1999 



1999 



MultiBoostg'"°^P 
MultiBoost^^^P 
Linear SVM 
NonUnear SVM (HIK) 



^ features used 


accuracy (%) 


1000 


70.9 (0.40) 


1000 


72.2 (0.70) 


1000 


76.0 (0.48) 


1000 


75.7 (0.10) 


1000 


76.5 (0.67) 


1000 


77.6 (0.59) 


1000 


77.8 (0.77) 


1000 


79.2 (0.82) 


6200 


76.3 (0.88) 


6200 


81.4 (0.60) 



Table 5: Recognition rate of various algorithms on Scenel5 data sets. All experiments are 
run 5 times. The average accuracy mean and standard deviation (in percentage) 



are reported. Results marked by f were reported in Zhang et al. (2009). 



Figure [7| shows the average classification errors. We observe that both MultiBoost ^"^""p 
and MultiBoost^i converge quickly in the beginning. However, MultiBoost has a better 
overall convergence rate. We also observe that both (MultiBoost s'^°^p and MultiBoostflsx^), 
have the lowest test error compared to other algorithms evaluated. We also apply a multi- 



class SVM to the above data set using the LIBSVM package (Chang and Lin, 2011) and 
report the recognition results in Table [5] SVM with 6, 200 features achieves an average ac- 
curacy of 76.30% (linear) and 81.47% (non-linear). Our results indicate that both proposed 
approaches achieve a comparable accuracy to non-linear SVM while requiring less number 
of features (77.8% accuracy for MultiBoost s''°"p with 1000 features and 79.2% accuracy for 
MultiBoostflsTP). 



Traffic sign recognition We evaluate our approach on the recent German traffic sign 
recognition benchmarlij^ Data sets consist of 43 classes with more than 50, 000 images in 
total. We randomly select 100 samples from each class to train our classifier. We use the 
provided test set to evaluate the performance of our classifiers (12, 569 images). All training 
images are scaled to 40 x 40 pixels using bilinear interpolation. Three different types of 
pre-computed HOG features are provided (6,052 features). We combine all three types 
together. We also make use of histogram of hue values (256 bins). Hence, there is a total 
of 6,308 features. The results of different classifiers are shown in Figure |8j Our proposed 
classifier outperforms other evaluated classifiers. As a baseline, we train a multi-class SVM 



using LIBSVM ( |Chang and LiHl |20Tl| ). SVM achieves 93.05% (using 6,308 features) while 
our classifier achieves 95.62% for MultiBoost ^^''p and 95.42% for MultiBoost^™7 with a 
much smaller set of features (500 features). Note that an overfitting behavior is observed 
for MuhiBoost^i. 



5. http://benclimark.ini.rub.de/ 
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# features (log sealc) 



Figure 7: Performance of different classifiers on the scene recognition data set. We also re- 
port the number of features required to achieve similar results to linear multi-class 
SVM. Both of our methods (MultiBoost s^^ip and MultiBoostfAST'') outperform 
other evaluated boosting algorithms. 




1 10 1(1(1 

# features (log scale) 



Figure 8: Performance of different classifiers on traffic sign recognition data sets. We also 
report the number of features needed to achieve a similar accuracy to the linear 
SVM. Both of our methods outperform other multi-class methods in terms of the 
test error. 



4. Conclusion 

In this work, we have presented a direct formulation for multi-class boosting. We derive the 
Lagrange dual of the formulated primal optimization problem. Based on the dual problem, 
we are able to design fully-corrective boosting using the column generation technique. At 
each iteration, all weak classifiers' weights are updated. We then generalize our approach 
and propose a new feature-sharing multi-class boosting method. The proposed boosting 
is based on the primal-dual view of the group sparsity regularized optimization. Various 
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experiments on a few different data sets demonstrate that our direct multi-class boosting 
achieves competitive test accuracy compared with other existing multi-class boosting. 

Future research topics include how to efficiently solve the convex optimization problems 
of the proposed multi-class boosting. Conventional multi-class boosting do not need to 
solve convex optimization at each step and thus much faster. We also want to explore 
the possibility of structural learning with boosting by extending the proposed multi-class 
boosting framework. 
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